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An Evaluation of Delphi 



FRED WOUDENBERG 



ABSTRACT 

The literature concerning quantitative applications of the Delphi method is reviewed. No evidence was 
found to support the view that Delphi is more accurate than other judgment methods or that consensus in a 
Delphi is achieved by dissemination of information to all participants. Existing data suggest that consensus is 
achieved mainly by group pressure to conformity, mediated by the statistical group response that is fed back 
to all participants. 



Introduction 

Human judgment is necessary in situations of uncertainty and, because these situ- 
ations abound, is very much relied upon. However, numerous reports about flaws in 
human judgment and its inferiority to more formal methods of judgment have appeared 
[1-6]. Research since the beginning of this century has sought to discover the factors 
responsible for the shortcomings in human judgment, with the aim of developing more 
accurate judgment methods. In reviewing these studies, several authors [7-12] draw the 
same three conclusions: 

1. A statistical aggregate of several individual judgments is more accurate than 
judgment of a random individual. Some authors [10, 12] refine this by noting 
that this holds only for tasks of simple to intermediate difficulty. 

2. Judgments resulting from interacting groups are more accurate than a statistically 
aggregated judgment. Also, interaction leads to stronger agreement. 

3. Unstructured, direct interaction still has disadvantages that lead to suboptimal 
accuracy of judgments. 

Starting from these three premises, the most logical next step is to develop judgment 
methods that possess all the advantages, but not the disadvantages, of unstructured, direct 
interaction. Many of these methods have been developed. In most, one tries to overcome 
the disadvantages of unstructured, direct interaction by structuring the interaction between 
participants. The nominal group technique, developed by van de Ven and Delbecq [13], 
is the most widely known structured, direct interaction method. In other procedures, 
interaction is also structured, but no allowance is made for direct interaction. The best- 
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known structured, indirect interaction method is the Delphi technique, which is the focus 
of the present article. 

Based on the above-mentioned literature reviews and the expectations derived from 
them [see, e.g., 14], the different judgment methods can be put on a scale of increasing 
accuracy (Table 1). In this article the expected high relative accuracy of Delphi is eval- 
uated. Both the accuracy of the Delphi method as a whole and the contribution of the 
separate Delphi characteristics to its accuracy will be discussed. In addition, the reliability 
of Delphi and, briefly, its capacity to induce consensus are evaluated. 

History of Delphi 

The first experiment using a Delphi methodology was performed in 1948 to improve 
betting scores at horse races [15]. The name "Delphi" was coined by Kaplan [see 16], 
a philosopher working for the Rand Corporation who headed a research effort directed 
at improving the use of expert predictions in policy-making. Kaplan et al. [17] found 
that unstructured, direct interaction did not lead to more accurate predictions than statistical 
aggregation of individual predictions. To make better use of the potential of group 
interaction, Gordon, Helmer, and Dalkey, also at the Rand Corporation, developed the 
Delphi method in the 1950s. Between 1950 and 1963 they conducted 14 Delphi exper- 
iments, but, as a consequence of its military character, all this work was secret. The first 
article describing some of this research was published in 1963 [18]. In 1964, Gordon 
and Helmer published an article that roused worldwide interest in Delphi [19]. 

Delphi was developed as a method to increase the accuracy of forecasts. Many other 
types of Delphi have been derived from the original method: 

• Delphi to estimate unknown parameters [20] 

• Policy Delphi [21] 

• Decision Delphi [22] 

A Delphi has even been used to compile an epidemiological dictionary [23]. 

In the 1950s and 1960s, Delphi was used mainly to make quantitative assessments 
(forecasting dates and estimating unknown parameters). In the 1970s, stress was more 
and more put on the educational and communicational possibilities of Delphi [24-28], 
although Dalkey [29] had mentioned these possibilities in 1967. Some authors [30, 31] 
began to call Delphi a "communication device" and measured its success qualitatively as 
the satisfaction of the participants with the method instead of quantitatively as accuracy 
[32]. The evaluation of Delphi in the present article is restricted to the quantitative Delphi, 
and the conclusions pertain only to this form. 
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Characteristics of Delphi 

The characteristics of Delphi as it was originally developed are 

Anonymity. Participants, mostly experts, are approached by mail or computer. 

Iteration. There are several rounds. The first round can be inventory, in which 
participants are asked for events to be forecasted or parameters to be estimated. In 
subsequent rounds, participants are asked to give quantitative estimates about dates of 
future events or values of unknown parameters. The number of rounds is fixed in advance 
or determined according to a criterion of consensus in the group of participants or stability 
in individual judgments. 

Feedback. The results of an eventual first inventory round are clustered and sent 
back to all participants. In the first estimation round, participants give their quantitative 
estimates. Before the second and subsequent estimation rounds, the results of the whole 
group on the previous round are fed back in a statistical format (measure of central 
tendency plus variance) to all participants. On the second and subsequent estimation 
rounds, participants making judgments that deviate from the first-round group score 
according to a fixed criterion are asked to give arguments for their deviating estimates. 
Before the third and subsequent estimation rounds, these arguments are, along with the 
statistical results, fed back to all participants. 

Many variations on this standard method have been used. Delphis with partial 
anonymity have been conducted [22]. The number of rounds varies from two [33] to ten 
[34]. Delphis without a first inventory round are often used to save time. If an inventory 
is necessary, it is done by other means, such as interviewing key persons. Statistical 
feedback in a Delphi can vary from a single number [35] to complete distributions [36]. 
Feedback of arguments is rarely given. 

Evaluation of Delphi 

The accuracy 1 and reliability 2 of a judgment method are difficult to evaluate. The 
reason for this is that judgments cannot be equated to measurements. A measurement 
can be partitioned into a true score and an error component. The error component can 
be regarded as consisting of a number of random variables [37]. These random variables 
tend to cancel each other out in the long run, giving the error component an expectation 
of zero. 

A judgment can also be thought of as consisting of a true score and an error 
component. However, the error component cannot be regarded as consisting of random 
variables. More realistically, the error component in a judgment can be thought of as 
being influenced by person- and situation-specific factors. This means that with a judgment 
method there is ample opportunity for bias, and this bias can vary from application to 
application of the method. As a consequence, every new application of the method can 



'Accuracy is meant here as the correspondence between the judgment and the true value; in statistical 
terms, it is designated as external validity. 

Reliability is the certainty with which an instrument (for instance, a judgment method) reflects true scores 
and not random errors. Reliability indicates the reproducibility of an instrument. High reliability implies that 
measurements (judgments) reflect the true score, but it does not guarantee the true score is correct and does 
not contain systematic error. This is reflected in the common remark that high reliability is a necessary — but 
not sufficient — condition for high accuracy. 
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be seen as a new measuring instrument. Evaluation of the accuracy and reliability of 
Delphi, being a judgment method, is therefore seriously hampered by the possible influ- 
ence of person- and situation-specific biases. 3 

Accuracy of the Delphi Method as a Whole 

Because it is difficult to evaluate the accuracy of a judgment method, it is not 
surprising that the accuracy of the Delphi has been falsely inferred from other criteria, 
such as consensus [38], the log-normality of first estimates [39], and the relation between 
remoteness and precision of a forecast [40] . The most feasible way to evaluate the accuracy 
of Delphi is to compare it directly to other judgment methods in the same situation. The 
main sources of bias that remain are order effects when the same participants are used 
with the different methods and person-specific effects when different groups of participants 
are used. These effects can be minimized with proper randomization. 

A number of studies have been reported in which accuracy was evaluated by com- 
paring Delphi directly to other methods (see Table 2). In most studies no statistical 
comparison between methods was made. Therefore, the accuracy of the investigated 
methods was rank ordered from most to least accurate. Table 3 lists all 26 possible 
pairwise comparisons between methods of the 17 mentioned studies. A slight — but not 
unequivocal — indication for Delphi's expected higher accuracy as compared to unstruc- 
tured, direct interaction can be observed. A similarly unequivocal suggestion can be found 
in Delphi's lower accuracy as compared to the staticized group. The two suggestions 
taken together (Delphi being more accurate than unstructured, direct interaction, but less 
accurate than a staticized group) are not easy to interpret and can even be called anomalous. 



3 Note that not only is the evaluation of the accuracy and reliability of the Delphi method hampered by 
person- and situation-specific biases, but also its validation and standardization. 
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In general (see the Introduction), it has been found that unstructured, direct interaction 
gives more accurate results than does a staticized group. The present suggestions imply 
that a staticized group is more accurate than unstructured, direct interaction. Unfortu- 
nately, very few direct comparisons between unstructured, direct interaction and a sta- 
ticized group were made in the reviewed studies. In three comparisons, a staticized group 
was once more accurate and twice equally accurate, slightly supporting the anomalous 
conclusion. The comparison between Delphi and structured, direct interaction suggests 
that there is no difference in accuracy. Also, the comparisons between Delphi and all 
other methods (meaningful because Delphi has been proposed as the most accurate judg- 
ment method available) show no difference. 

Taken together, the reviewed studies do not offer easily interpretable conclusions 
or unequivocal outcomes of comparisons. A closer scrutiny of the separate studies does 
not offer a clue as to the factors determining these unequivocal results. A criticism 
pertaining to all the studies is that experiments were conducted in a laboratory, while the 
quiescence of the private environment has been mentioned as a distinct advantage of 
Delphi. Also, in almost no study were expert participants used. Furthermore, in most 
studies no arguments of deviating judges were asked for and fed back to the entire group. 

More specific criticisms can be leveled against the studies, including those seven 
[14, 20, 33, 41-44] that found Delphi to be the most accurate method. The experiments 
Campbell [41] describes in his thesis are often cited as evidence of Delphi's higher 
accuracy as compared to unstructured, direct interaction. Sackman [45], although calling 
Campbell's experiments well conducted, criticizes them because the unstructured, direct 
interacting control group was not really unstructured, but "force-fitted into a Delphi-type 
format." Although Campbell did this to make comparison with Delphi feasible, Sackman 
concludes that it obstructs comparison because the participants in Campbell's interacting 
group did not get enough opportunity for spontaneous interaction, and therefore this group 
has to be regarded as seriously disadvantaged in comparison to Delphi. 

Pfeiffer [43] reported Delphi to be more accurate than what seems to be a staticized 
group response for 13 of 16 short-term economic indicators. This result was not obtained 
by Pfeiffer himself. In his book, Pfeiffer describes these results in two short paragraphs. 
He only writes that the experiments were conducted at the University of California in 
Los Angeles. He seems to refer to the experiments performed by Campbell. But Camp- 
bell's research concerned the comparison between Delphi and unstructured, direct inter- 
action. It is not clear whether Pfeiffer is wrongly referring to this research or rightly 
referring to a separate part of this research that has not been described elsewhere. 

The experiments by Dalkey [20, 46, 47] probably most strongly contributed to the 
view that the accuracy of judgments can be increased by using the Delphi method. In a 
series of 11 experiments, Dalkey asked students to answer almanac-type questions, such 
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as "In what year was nylon invented?" and "What was the circulation of Playboy magazine 
as of January 1, 1966?" Dalkey's experiments are strongly criticized for several reasons. 
First, it should be noted that the judgments in Dalkey's experiments did not always 
become more accurate over rounds. Improvement occurred in two-thirds of the questions, 
and deterioration was found in the remaining one-third. The tasks Dalkey used were very 
simple, and some authors [48] considered them unrepresentative of most judgmental 
problems encountered in real life. Also, the nonparametric statistical tests Dalkey used 
to evaluate his results are criticized for their simplicity [49] . Dalkey only recorded whether 
improvement over rounds did or did not occur, independent of the absolute change in 
accuracy. As a result, very small changes in accuracy can attain much weight. A striking 
example of this is found in a study performed by Brown and Helmer [50], using Dalkey's 
method of analysis. Here, increased accuracy over rounds was also found for two-thirds 
(13 of 20) and decreased accuracy for one-third (7 of 20) of the questions. Of the 13 
questions improving in accuracy over rounds, the improvement was 0-0. 1 standard scores 
for five questions. 0.1-0.2 standard scores for four questions, 0.2-0.3 standard scores 
for three questions, and almost 0.5 standard score for only one question. That these 
increases are very modest can be seen in the comparison of last-round Delphi scores with 
the first-round scores of a random individual judge, which can be regarded as an absolute 
minimal baseline. Delphi scores were more accurate on a median of 12.5 of 20 questions, 
while the random judge was more accurate on the remaining 7.5 questions. An even less 
favorable picture appears when interquartile distances are considered. In the first round, 
13 of the 20 interquartile distances covered the true value. Because of convergence, only 
seven values fell in the interquartile distance on the last round. 

An unpublished report by Sack [44] is cited by Riggs [33] in two sentences. The 
only information given is that Delphi was more accurate in short-term forecasting than 
an unstructured, direct interacting group, but not at the stated level of statistical signif- 
icance. Riggs's own study [33], in which Delphi was found to be more accurate than 
unstructured, direct interaction, made use of a modified Delphi. Participants were not 
anonymous and only two rounds were run. 

In the study by Parente et al. [42], Delphi was more accurate than a staticized group, 
but the comparison between Delphi and the staticized group was done with different 
groups of participants in two different situations. First, for ten events, a group of 300 
students forecasted if and (if so) when the events would occur during the next months. 
Three groups of 100 students each made forecasts weekly during a time span of one, 
two, or three months. A staticized group result was calculated based on two events that 
occurred in the designated time period. Following this, a Delphi experiment was con- 
ducted. This time, 80 new students made forecasts about 30 new events. Again, students 
had to make a forecast every week, all for a time span of two months. Of the 30 events, 
six occurred in these two months and these six events were used in the analysis. It is 
clear that the different groups of participants and events, as well as the different number 
of participants and events, seriously hinders the comparison between the staticized group 
and the Delphi. 

The study by Erffmeyer and Lane [14] compared four procedures using the NASA 
"lost on the moon" exercise: (a) unstructured, direct interaction, (b) consensus group, 
(c) structured, direct interaction, and (d) Delphi. Participants in the consensus group 
engaged in unstructured direct interaction, but followed guidelines to resolve conflicts. 
Participants had to rank 15 items of equipment in terms of importance for the survival 
of a shipwrecked crew on the moon. The results of four groups, each using one of the 
four different procedures, were compared to the "correct" rank order assigned by NASA 
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experts. The Delphi group was the most accurate, followed by the consensus group and 
the unstructured, direct interacting group; the structured, direct interacting group was the 
least accurate. These results can, of course, be questioned because the correct rank order 
is unknown. There is no way to tell whether the NASA experts' rank order is correct. It 
would be interesting to know if the experts themselves would rank order the items 
identically under the four different procedures. 

The above-mentioned criticisms of studies reporting Delphi to be more accurate than 
one or more other judgment procedures can be and have been used against Delphi. The 
seven studies [48, 49, 51-55] in which Delphi was found to be less accurate than at least 
one other method can, however, likewise be criticized. 

In the study by Farquhar [51], Delphi was found to be less accurate than unstructured, 
direct interaction in two separate experiments. But the interpretation of these results is 
hampered by the small and inequal number of participants. The Delphi groups had nine 
and four participants, and the unstructured, direct interacting groups had five and three. 
Group size can have a profound effect on the accuracy of judgments [20, 56]. 

Gustafson et al. [52] compared Delphi with a staticized group, with unstructured, 
direct interaction, and with structured, direct interaction. Delphi was the least accurate 
method. These results have been questioned on two grounds. Fischer [57] contends that 
the dependent variable used by Gustafson et al. exaggerates small differences, and he 
reanalyzes the results with another dependent variable. The outcome of this analysis is 
that no differences between methods are found. Van de Ven and Delbecq [58] ascribe 
Gustafson et al.'s negative findings with Delphi to the specific variant of Delphi used. 
This was not a real Delphi, but a derived method called "estimate-feedback-estimate." 
Small groups of four participants sat together and received written feedback. The un- 
naturalness of written feedback in a gathered group with the prohibition of direct contact 
could, according to van de Ven and Delbecq, have induced negative social affiliation 
with detrimental effect on the results. 

In the study by Ford [49], a staticized group was more accurate than two versions 
of Delphi. But the absolute differences in accuracy were small. The only striking result 
was that over four rounds Delphi estimates became slightly less accurate and staticized 
group estimates slightly more accurate. Ford's study can be criticized for his use of a 
within-subjects design in which four groups of subjects all participated in four different 
judgment methods. Although Ford neatly randomized the order in which the four ex- 
perimental groups participated in the four methods by making use of a Latin square, his 
results do suggest order effects. Randomizing order allows for testing order effects, but 
it is no guarantee that order effects will not occur. For instance, having a Delphi (in 
which feedback is provided) before a staticized group method (iteration of individual 
judgments in which feedback is not provided) cannot be expected to be counterbalanced 
by giving another group of subjects the staticized group method before the Delphi. 

The study by Seaver [55] compared five methods (no interaction; unstructured, direct 
interaction; two methods using structured, direct interaction; and Delphi) under six ag- 
gregation procedures. Averaged over all aggregation procedures, all kinds of interaction 
tended to decrease the accuracy of probabilistic answers to two-alternative general knowl- 
edge questions. The design and methodology of the study were quite complex. Ten groups 
of four persons each answered questions before and after interaction (except in the no- 
interaction condition). Five sets of questions and the five procedures were balanced using 
a Greco-Latin square design. Just as in the study by Ford [49], the randomization of 
order does not imply that order effects do not occur. It cannot be determined whether 
Seaver's data, like those of Ford, suggest order effects, because Seaver's paper does not 
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give the worked-out order of procedures. In sharp contrast to the statistical sophistication 
of the study, the interaction procedures Seaver used were of rudimentary form. Delphi 
was barely Delphi-like. Only two rounds were used. Before the second round the ex- 
perimenter personally informed each group member about the other members' assessed 
probabilities and the reasons for them. Therefore, individual scores, but not a group 
score, were fed back to all participants. 

The study by Miner [53] reports Delphi to be significantly less accurate than a 
structured, direct interaction method that was the focus of the article. But there was no 
difference between Delphi and the most widely known structured, direct interaction 
method, the nominal group technique. Erffmeyer and Lane [14] criticize the type of 
Delphi used by Miner, primarily for the lack of physical separation of participants. 

Moskowitz and Bajgier [54] found Delphi to be less accurate than unstructured, 
direct interaction. The comparison of both methods was, however, not the primary object 
of the study. Little attention was paid to the characteristics of the Delphi. Participants 
were not anonymous and feedback of individual scores instead of group scores was given. 

In the study by Boje and Murnighan [48], Delphi was more accurate than unstruc- 
tured, direct interaction, but less accurate than a staticized group. However, the difference 
between Delphi and the staticized group was small. A confounding factor in this study 
was the big difference in the first-round estimates between the groups. They were most 
accurate for the Delphi and subsequently became less accurate over rounds. The unstruc- 
tured, direct interacting group was least accurate on the first round and remained so. The 
accuracy of the staticized group was between those of the other two groups, but subse- 
quently improved over rounds. 

Three studies did not find a difference between Delphi and one or more other methods. 
These studies can also be criticized. In the study by Brockhoff [59], not less than seven 
hypotheses were tested at the same time, of which the difference between Delphi and 
unstructured, direct interaction was only one. Not surprisingly, a confounded result was 
found. Delphi was slightly more accurate with respect to forecasts and slightly less accurate 
with respect to fact-finding questions. As the author himself remarks, it cannot be excluded 
that still other factors — for instance, group size — influenced these results. 

In the study by Rohrbaugh [12], both Delphi estimates and those produced in a 
structured, direct interaction method became more accurate over rounds, but only the last 
increase was significant. Rohrbaugh found no difference in last-round accuracy between 
both groups; this could be the result of the lower accuracy of first-round estimates in the 
structured, direct interacting group. A notable feature of this experiment was that par- 
ticipants were not instructed to give their most accurate estimates, but to "reduce the 
existing differences within their group." The influence of these instructions on the end 
result cannot be ascertained, but setting a goal other than optimal accuracy can hardly 
be expected to optimize accuracy. 

In the study by Fischer [57], all four methods mentioned in Table 2 were compared. 
Amazingly, no difference between any pair of methods was found. Seaver [55] partly 
holds Fischer's method of analysis responsible for this. The proper scoring rules Fischer 
used are, according to Seaver, relatively insensitive to differences between estimates. 
The Delphi variant used could be called atypical. There were eight groups of only three 
persons each. Only two rounds were run. The minimal requirements for sharing infor- 
mation in a Delphi do not seem to be met in such small groups with very little opportunity 
for feedback. 

Scrutiny of the separate studies does not lead to more easily interpretable conclusions 
or clues for the reasons why no comparison between methods gives an equivocal result. 
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The only justified conclusion seems to be that factors other than the specific method used 
(capability of the group leader, motivation of the participants, quality of the instructions, 
etc.) to a large extent determine the accuracy of an application of a judgment method. 
In accordance with this, one of the most consistent findings is that the method which was 
the primary focus of an article, and which can be expected to be preferred by the author(s) , 
was almost always found to rank highest in accuracy [14, 20, 33, 41, 43, 49, 52, 53]. 
The only two exceptions are the studies by Brockhoff [59] and Rohrbaugh [12]. Brockhoff 
was unable to show Delphi's greater accuracy as compared to unstructured, direct inter- 
action, and Rohrbaugh could not show his self-developed structured, direct interaction 
method (social judgment analysis) to be more accurate than Delphi. Both authors try to 
find methodological explanations for this and also mention additional advantages of their 
preferred method. The predominant positive results with the method of preference suggests 
that the unidentified factors in judgment methods responsible for high accuracy are best 
taken care of when the investigator has confidence in the method being used. For this 
reason, it is very difficult to evaluate the greater accuracy — as compared to Delphi — of 
a few highly idiosyncratic methods, tested only once by their developer and main pro- 
ponent [10, 49, 52, 60, 61]. 

Contribution of the Delphi Characteristics to Its Accuracy 

ANONYMITY 

Anonymity in a Delphi is meant to exclude group interaction processes that decrease 
the accuracy of group judgment, while preserving positive influences [56, 62]. Anonymity 
has been criticized [63] for its intrinsic negative effects (lack of a feeling of responsibility 
for the end result) and because it would obviate some intrinsic positive effects of un- 
structured, direct group interaction (flexibility and the richness of nonverbal communi- 
cation). A conclusion regarding the (dis)advantages of anonymity is difficult to draw. 
The only data giving a rough indication concern the satisfaction of participants with the 
Delphi method. This satisfaction varies strongly [48, 53, 58, 61, 64]. A possible problem 
with anonymity is low compliance. In one study [22], partial anonymity led to a higher 
response. 

USE AND SELECTION OF EXPERTS 

It seems obvious to use experts in situations of high uncertainty. According to many 
authors [45, 65-68], however, the lack of directly relevant information in uncertain 
situations determines judgments more than the information that is available, and conse- 
quently experts are not more accurate than nonexperts. One author [28] even contends 
experts perform worse than nonexperts, because the former are more strongly influenced 
by the desirability of an answer. Within the context of Delphi, expertise has more 
specifically been investigated by means of the relation between accuracy and self-ratings. 
Negative [42, 48, 65, 67-70] as well as positive [35, 47, 50, 71, 72] correlations have 
been found. Dalkey [72] tried to reconcile these contradictory results by posing that no 
substantial correlation exists, either positive or negative, for individuals, but that a positive 
correlation holds for groups from approximately seven persons. Dalkey even argues that 
selection of a subgroup with high self-ratings leads to the same increase in accuracy as 
he finds with the Delphi method. He proposes a combination of both: subgroups for 
questions having enough participants with high self-ratings and a Delphi for the remaining 
questions. A disadvantage of selecting a subgroup is that group size is reduced, with the 
possibility of a decrease in accuracy due to a larger random error. Research with other 
judgment methods has shown that selection of a subgroup increases accuracy only if the 
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group shows large variability in accuracy, if the probability of making a wrong judgment 
is great, and if the most accurate judges can be selected with high certainty [73-75] . 

ITERATION 

The original purpose of iteration in Delphi is to have only the least-informed par- 
ticipants change their minds over rounds [56, 62]. Critics [45] assert iteration only leads 
to boredom. In a great many studies, a slight improvement in accuracy over rounds is 
found (for review, see Armstrong [7], Dietz [69], and Erffmeyer et al. [76]), although 
there are also studies finding no improvement [48, 52, 60]. In nearly all studies finding 
improvement, most or all improvement takes place between the first and second estimation 
rounds [7, 10, 20, 28, 69, 77]. In a few studies accuracy further increased after the 
second estimation round [60, 76]. Iteration cannot be considered independent from feed- 
back in a Delphi. Only a few studies investigated the effects of iteration and feedback 
on accuracy independently from each other. In three studies [42, 48, 49], a slight increase 
in accuracy was caused completely by iteration, and in only one [47] by feedback. 

FEEDBACK 

The idea behind providing feedback is to share the total information available to a 
group of individual experts. Those experts who find the composite judgment of the group 
or the arguments of deviating experts more compelling than their own should subsequently 
modify their judgment [56, 62]. In this way, group pressure to conformity is prevented 
and any change in judgment is caused by new information only. Feedback in a Delphi 
can consist of the statistical summary of the group response as well as arguments of 
deviating judges. There is limited evidence that arguments of deviating judges leads to 
an increase in accuracy. Three studies [61, 71, 78] report a slight increase in accuracy 
due to arguments over the increase caused by statistical feedback. In one study [20] 
feedback of arguments decreased accuracy. In nearly all studies, the largest increase in 
accuracy in a Delphi is found between the first and second estimation rounds. Only the 
statistical group response has then been fed back. 

A host of support can be found for the assertion that statistical feedback induces 
conformity. Numerous studies [12, 20, 28, 38, 49, 64, 79-83] show that changes over 
rounds are in the direction of the group response that has been fed back. For most of 
these studies, Ford's [49] remark holds: "Delphi methods induce change toward the 
median, but rarely cause the median to change." Change toward the fed-back value also 
occurs when this value is false [33, 80, 83, 84]. In two studies [20, 35] a "pull to the 
mean" has been compared directly to a "pull to the true." By far the greatest increase in 
accuracy was caused by the pull to the mean. In these two studies and in a later one [77], 
a relation was found between the distance of an individual's judgment to the group response 
and the percentage of judges subsequently changing their estimate (see Figure 1). This 
relation suggests an explanation for the slight increase in accuracy over rounds found in 
many Delphis. First-round estimates have been found to be distributed lognormally (see 
Figure 2). In such a negatively skewed distribution, the modus lies before the mean and 
median. Consequently, the distance to the median is on average smaller for values to the 
left than for values to the right of the median. Two processes may cause a subsequent 
change toward the median: 

1. The relation between distance to the group response and subsequent change is 
asymptotic. Because values to the left are on average closer to the median than 
those to the right, newly estimated values will come even closer to the left of 
the old median than those to the right. 
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Fig. 1. Proportion of participants 
changing opinion versus deviation from 
the median. From Dalkey [20]. 



2. The relation between distance to the group response and subsequent change is 
not perfect. Also, values on and close to the median will be changed. Because 
values to the left of the median are on the average closer and, upon new estimation, 
come even closer to the median than the values to the right, there will be more 
crossings from left to right than from right to left. 

The net result is a small increase in the median. If the initial group response underestimates 
the true value, the result is a slight increase in accuracy. If the initial group response 
overestimates the true value, the result is a slight decrease in accuracy. Support for this 
has been found in several studies [17, 78, 85]. 

A definite conclusion regarding feedback in a Delphi is now possible. The small 
increase in accuracy over rounds found in many Delphis can be ascribed partly to the 
mere iteration of judgment (see above) and partly to an artifactual by-product of the 
pressure to conformity caused by the statistical feedback. In any case, changes in estimates 
caused by feedback, whether or not associated with an increase in accuracy, are not 
induced by dissemination of information to all participants, but are the result of group 
pressure to conformity. This is supported by the finding that feedback of arguments, 
which can especially be though to disseminate information, has only a negligible effect 
on the accuracy of estimates. 
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Fig. 2. Distribution of first-round es- 
timates. From Dalkey [20]. 
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TABLE 4 

Reliability of Delphi in Studies Conduc t ing !v.« ,ir nun. Oilphis with thi >anu ^at^ions 

Study Reliability 

Helmer [86] 
Bender et al. [88] 
Anient [87] 
Dalkey [20] 
Dalkey et al. [72] 
Sahr [89] 
Welty [90] 
Welty |67] 

Amara and Salancik [93] 

Grabbe and Pyke [94] 

Martino [see 45] 

Huckfeldt and Judd [95] 

Sackman [45] 

Dagenais [96] 

"Pearson product-moment correlation calculated by the present author 
^Unspecified correlation coefficient reported by the original author 
'Rank-order coefficient reported by the original author. 
''Kappa reliabilities reported by the original author. 

Reliability 

Because of person- and situation-specific factors, as described in the section on 
evaluation of Delphi, it would appear difficult to standardize Delphi and to evaluate its 
reliability. Reliability has exclusively been evaluated by comparing results of two groups 
of participants in the same Delphi. Such a procedure is acceptable if it is assumed that 
both groups have the same bias. With proper randomization, this assumption does not 
appear unreasonable. Unfortunately, the studies evaluating the reliability of Delphi suffer 
more serious shortcomings. A drawback of many studies is that comparisons were made 
between groups whose forecasts were obtained in different years. Experiments were from 
one [86] to eight [87] years apart. Clearly, forecasting the date of a future event in 1964 
is not the same as forecasting that date in 1972, when circumstances will have changed 
considerably. As with accuracy, many studies used no statistical criterion. Because re- 
liability can only be inferred from absolute values and not from comparisons to other 
methods, as with accuracy, there is a serious shortcoming here. Therefore, if raw scores 
were reported, Pearson product-moment correlation coefficients were calculated. An 
additional shortcoming of many studies is that no variances were indicated, making it 
unclear whether the intuitively judged high reliability was caused by genuine similarity 
or by overlap of unduly wide confidence intervals. An overview of studies assessing the 
reliability of Delphi is given in Table 4. They all report a relatively high reliability. 

Helmer [86] puts median results of three experiments next to each other. Variances 
were not considered. One experiment was conducted in 1963 with a traditional Delphi 
method [19]. Another was a pretest conducted in 1966 with 23 Rand Corporation em- 
ployees. The last experiment was done at a conference using 100 conference delegates 
(not physically separated from one other), of which 23 were selected to finish the exercise. 
The article gives no indication as to how this selection was performed. The 1966 pretest 
and 1967 conference Delphi could be compared on seven forecasts. One event was 
forecasted to occur in the same year by both groups. The other six forecasts differed by 
2-9 years. The correlation coefficient calculated from the raw data presented by the author 
was 0.87. Conference delegates remarked that serious biases existed in the wording of 
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the questions, and they did not expect their forecasts to be particularly "reliable." The 
1963 study contained four questions related — but not identical — to those in the other two 
studies. Differences ranged from one to 21 years. Because of the lack of similarity and 
the small number of questions, computation of coefficients is pointless. 

In the study by Bender et al. [88], three experiments were compared, also only with 
respect to median values. One was the above-mentioned study by Gordon and Helmer 
[19]. The other two were a pretest and the experiment proper, both conducted by the 
authors. The three studies could be compared on 1 1 forecasts. Correlation coefficients 
for the three pairwise comparisons, calculated from the raw data and excluding questions 
with unprecise answers (never, >x), were —0.06, 0.30, and 0.73. 

In the study by Ament [87], two experiments, one the study conducted by Gordon 
and Helmer [19] and the other by the author in 1969, were compared on 31 forecasts. 
Medians as well as interquartile distances were given. The phrasing of the questions was 
not the same in both studies. For 13 questions, a possible bias in phrasing was indicated 
by the author. Limiting analysis to the 18 questions for which a possible bias was not 
indicated, interquartile distances overlapped for 15. Although this number seems very 
large, interquartile distances were also large. Excluding indefinite values, means of the 
interquartile distances were 22.2 years for the 1963 experiment and 13.8 years for the 
1969 experiment. Corresponding means of years to forecasted occurrence were 23.7 and 
20, respectively. The correlation coefficient calculated from 15 median values for which 
precise values in both studies were given was 0.57. 

Dalkey [20] reports correlation coefficients on the basis of the median responses to 
20 almanac-type questions for seven pairs of groups of different sizes. The coefficients 
monotonically increased with group size, being almost 0.4 for a group of two respondents 
to 0.75 for a group of 1 1 respondents. 

Dalkey et al. [72] continued the above-mentioned study with an experiment involving 
eight pairs of groups of 15-20 respondents. Coefficients were calculated from first- round 
estimates. First-round scores in a Delphi are in fact staticized group responses. Dalkey 
et al.'s experiment therefore does not refer to Delphi at all. Another criticism of this 
experiment is that estimates were not used directly. Because correct answers were avail- 
able, group errors were calculated by taking the natural logarithm of the median estimate 
of the group divided by the true answer. Dalkey's method of analysis suggests that it is 
instructive to look at the relation between accuracy and reliability. Textbooks often remark 
that reliability is a necessary condition for accuracy. But the reverse relation is also 
interesting. High accuracy is a sufficient condition for high reliability. Very easy questions, 
having high accuracy, automatically lead to high reliability. This high reliability could 
be called spurious. Although Dalkey did not check for this possibility, the high coefficients 
found in his study could merely indicate that he asked simple questions. 

Sahr [89] compared three Delphi studies, but not directly with respect to reliability. 
Sackman [45] criticizes this study for its presentation of "some fifty pages filled with 
descriptive quantitative data," without reporting "a single statistic indicating variances, 
standard errors of measurement or product-moment reliability coefficients." 

The two studies by Welty [67, 90] report intuitively high reliability. However, these 
studies were meant to discredit Delphi. Of the two groups compared, one consisted of 
experts and the other of laypeople. By showing that both groups performed equally well, 
Welty intended to prove that Delphi's strong reliance on experts is misplaced. The two 
experiments Welty performed were partial replications with lay participants of a study 
by Rescher [91 , 92] with experts. Participants had to give estimates on a five-point scale 
about the importance of a number of American cultural values (14 in Welty 's first study 
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and 17 in the second) in the year 2000. The topic makes one wonder who is to be counted 
as an expert and who is not. For the first study [90], Welty only reports the results of 
an overall sign test. This test showed both groups were not significantly different (p = 
. 18). In the second study [67], an overall value could not be computed. Of the 16 questions 
that could be compared, only two were significantly different by means of an F-test. 
However, error variances were quite high. The mean of the 16 standard deviations was 
0.84 for the experts and 1.13 for the layjudges. Corresponding 99% confidence intervals 
that go along with a one-tailed test at the 1% level (used by Welty) are 1.96 and 2.63, 
respectively. This leaves very little room for obtaining a difference between two groups 
making judgments on a five-point scale, where, in fact, 31 of 32 group scores fell between 
points two and four. The correlation coefficient based on 15 of the 16 pairs of means, 
for which values in both groups were given, was 0.82. 

Amara and Salancik [93] also used an ordinal five-point scale on which two groups 
of participants had to rate the likelihood of 89 social developments in the 1980s. A 
correlation coefficient of 0.77 was found. The same criticisms as those leveled against 
Welty's study are valid. Most participants gave answers in the middle range of the scale, 
leaving very little room for a difference of more than one scale point. 

In the study by Grabbe and Pyke [94], six experiments, conducted from 1964 through 
1972, were compared on median forecasts. Of the six studies 15 pairwise comparisons 
can be extracted. The number of more-or-less comparable questions between studies 
ranged from zero to eight, with a mean of 2.6. Only eight comparisons involved identical 
questions: three concerning three questions, two concerning two questions, and three 
concerning one question. The small number of identical questions makes it impossible 
to calculate correlation coefficients for any pair of studies, therefore, nothing can be said 
quantitatively about the claim of the authors that "there appears to be reasonable good 
agreement among the dates." 

An attempt to demonstrate the reliability of Delphi by Martino is mentioned and at 
the same time criticized by Sackman [45]. In a number of independent studies Martino 
noticed several similar questions that resulted in similar forecasts. No statistical indexes 
were reported. 

Huckfeldt and Judd [95] compared the responses of five respondents answering two 
questions twice, separated by two weeks. Respondents had to indicate on a seven-point 
scale the likelihood of the occurrence of an event. The authors indicate only the percentage 
of responses differing one, two, three, or more scale points on the two occasions. It can 
be read in their data that 23% of the changes were more than one scale point and only 
10% more than two points different. No rank correlation is reported and, because no raw 
data are given, it could not be calculated. 

Along the lines followed by Welty [67, 90], also with the aim of disproving Delphi's 
strong reliance on experts, Sackman ]45] replicated the much-referred-to Gordon and 
Helmer study [19] several times with students . Spearman rank coefficients were calculated 
for the order of years in which events were forecasted. The average rank correlation 
between the several replications was 0.77. 

Dagenais [96J calculated kappa reliabilities for 16 pairs of groups of students asked 
to identify successful vocational education programs from a list of programs. Kappa 
values varied greatly. As with Dalkey 's data, a relation with accuracy was not investigated, 
but could be expected. The highest kappa value Dagenais reports is 1 , a value not to be 
expected with even the most sophisticated judgment method if something more compli- 
cated than a simple arithmetic question is asked. 

The correlations reported in or derived from the above-mentioned studies range from 
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very low to very high. In some cases, the intuitively found high reliability could not be 
substantiated quantitatively. Next to the great variation in correlations and the unability 
to quantify some results, the discussed data show many limitations and shortcomings 
(relation to accuracy, bias in phrasing of questions, different years of obtaining forecast, 
large or unindicated error variance). A methodology to evaluate the reliability of Delphi 
more thoroughly has been suggested by Hill and Fowles [65]. They suggest reliability 
should be ascertained in several ways, including test-retest evaluation and investigation 
of the effects of procedural variations on the results. Their proposal has not been adopted. 
A definite conclusion regarding reliability of the Delphi method must therefore be post- 
poned. The present data, however, do suggest the reliability of the Delphi method can 
hardly be expected to exist. Because of person- and situation-specific biases, a new 
measuring instrument is created with every new application of the method. These person- 
and situation-specific biases inevitably arise as every new set of questions is accompanied 
by a new group of experts giving judgments. Person-specific biases can only partly be 
removed by random selection of participants to groups. Differences between groups can 
also occur by selective attrition of participants during execution of a Delphi. Attrition in 
Delphis is very variable [45, 53, 58, 80, 95, 97]. Some data suggest attrition in a Delphi 
can be selective. It has, for instance, been found that dropouts are further removed from 
the group median on the first round than holdouts [77], implying that deviating participants 
drop out more often. 

Consensus 

In the introduction it was mentioned that in addition to the general view that inter- 
acting groups give more accurate judgments than a staticized group, interacting groups 
also show stronger agreement. A Delphi is extremely efficient in achieving consensus 
[12, 20, 28, 38, 49, 64, 79, 81-83]. Several studies report stronger consensus with a 
Delphi than with unstructured, direct interaction [33, 57]. As for accuracy, consensus is 
almost always maximum after the second estimation round [16, 18, 80, 95]. Although 
consensus can be important, it can never be the primary goal of a Delphi. High consensus 
is neither a necessary nor a sufficient condition for high accuracy. In most Delphis a 
slight increase in accuracy over rounds is found (see the section on iteration). In contrast, 
consensus increases very strongly. Also, a direct comparison shows that consensus in- 
creases much more strongly than accuracy [20] . Together with the indications that group 
pressure to conformity is very strong in a Delphi, this makes consensus in a Delphi 
suspect and in no way related to genuine agreement. Consensus in a Delphi is therefore 
not a good criterion, not even as a secondary one. Most probably, in situations of 
uncertainty lack of consensus is inevitable. As Stewart and Glantz [98] remark, "The 
same lack of knowledge that produced the need for a study that relied on expert judgment 
virtually assures that a group of 'diverse experts' will disagree." 

Conclusions 

The data discussed in the present article leave no other possibility open than for a 
negative evaluation of quantitative Delphi. The main claim of Delphi — to remove the 
negative effects of unstructured, direct interaction — cannot be substantiated. In many 
Delphis a slight increase in accuracy over rounds is found. But this increase can be 
ascribed partly to mere repetition of judgment (possibly by giving judges opportunity to 
contemplate their judgments) and partly to an artifactual consequence of group pressure 
to conformity. A Delphi is extremely efficient in obtaining consensus, but this consensus 
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is not based on genuine agreement; rather, it is the result of the same strong group pressure 
to conformity. 

In view of the manifold applications of Delphi all over the world, the negative 
conclusion drawn in the present article may seem surprising. But negative evaluations of 
Delphi have been appearing since the 1960s. In the Rand Corporation study that aroused 
worldwide interest for Delphi [19], Gordon and Helmer used forecasting as a weaker 
term for prediction to indicate the tentative nature of their and related investigations. The 
fame of the study seems to be based more on the quality of the participants (Isaac Asimov, 
Arthur C. Clarke, Carl G. Hempel, and Stephen Toulmin, among others) than on the 
quality of the study itself or its results [99]. Dalkey, next to Gordon and Helmer the main 
developer of Delphi, summed up most of the negative aspects of Delphi, including the 
strong group pressure to conformity induced by statistical feedback of the group response, 
in two articles published in 1968 [16] and 1969 [20]. In 1971 , two review articles appeared 
[28, 31] that for the first time prudently concluded that Delphi may not be fit for quan- 
titative application. Not in the least prudent, this conclusion is repeated by Sackman in 
his well-known Delphi Critique [45]. Although several authors [24, 32, 84, 100] tried 
to rebut Sackman, critical articles about Delphi since his vehement attack have not ceased 
to appear [11, 77]. A few review articles favorable to Delphi have also appeared. Riggs 
[33] reviewed three studies and described one experiment in which Delphi was compared 
to other judgment methods. Delphi was the most accurate method in one of the reviewed 
studies, the least accurate in another, and more accurate, but not statistically significant 
in the third. After concluding his own study, finding a nontraditional variant of Delphi 
to be more accurate than unstructured, direct interaction, Riggs concludes, "in spite of 
the devious limitations of the study, the evidence lends credibility to the statement that 
Delphi procedures are superior to conference methods for long-range forecasting." An- 
other favorable review was written by Shneiderman [61]. After reviewing five studies, 
he concludes that "the Delphi on the whole produces somewhat more accurate final results 
than the traditional committee." This conclusion seems not to be based on the reviewed 
studies, because of the five he reviewed, Delphi was the most accurate in two, the least 
accurate in two, and equally accurate to another method in the fifth. 

Partly in response to the strong critique leveled against Delphi in the 1970s, the 
quantitative applications have become dominated by more qualitative applications (see 
the history of Delphi above). The proliferation of quantitative applications has been 
publicized by many authors [10, 26, 30, 32, 101]. They describe Delphi with the metaphor 
"art instead of science." It can be questioned whether the "art instead of science" approach 
can be regarded as a watertight defense against the critique of Delphi's many drawbacks. 
Quantitative and qualitative statements cannot strictly be divided. Ordering priorities or 
signaling future developments implies making assumptions about which social, cultural, 
and political developments will occur and when they will occur. In any judgment that 
has accuracy as a goal or, at least, underlying assumption, accuracy cannot be expelled 
as a necessary criterion for the particular judgment method used. Only the pure decision 
Delphi, where the primary goal is consensus, could perhaps be excluded from the demand 
of the criterion of accuracy. For all other applications, qualitative and quantitative alike, 
Ascher's [102] remark about forecasts holds: "accuracy is an asset because the utilization 
of forecasts requires, at a minimum credibility" (emphasis by the original author). There 
exist no clues that the drawbacks found in quantitative applications of Delphi do not also 
occur in more qualitative applications. With the negative verdict on quantitative Delphi 
applications, it is troubling that "the transition from forecasting and estimation to value 
Delphis has been lightly made without much in the way of evaluation or reanalysis" [27]. 
The problems this creates for Delphi are nicely illustrated in a set of statements taken 
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from a review by Murray [26], in which the drawbacks of Delphi are recognized, but 
the qualitative type of Delphi is nevertheless favored: "Delphi is thus not a science but 
an art," which can be used when "nothing better than opinion can be achieved," while 
"the final justification for the technique must be on its usefulness to decision makers." 
Last-resort arguments like this seem, at the least, questionable to justify the use of Delphi 
when it can be clearly shown that Delphi is in no way superior to other (simpler, faster, 
and cheaper) judgment methods. 

This review was written with the support of a Dutch government grant, aimed at 
investigating the usefulness of judgment methods in the assessment of the toxic effects of 
hazardous chemicals on humans. 
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